Skip to content

fix(gpu): add WSL2 GPU support via CDI mode#411

Closed
tyeth-ai-assisted wants to merge 3 commits intoNVIDIA:mainfrom
tyeth-ai-assisted:fix/wsl2-gpu-support
Closed

fix(gpu): add WSL2 GPU support via CDI mode#411
tyeth-ai-assisted wants to merge 3 commits intoNVIDIA:mainfrom
tyeth-ai-assisted:fix/wsl2-gpu-support

Conversation

@tyeth-ai-assisted
Copy link

Summary

  • Detect WSL2 at gateway startup (/dev/dxg present) and automatically configure CDI-based GPU injection
  • Fixes the complete nvidia-device-plugin failure chain on WSL2: NFD can't see PCI, NVML can't init without libdxcore.so, CDI spec missing per-GPU UUID entries
  • All changes are in cluster-entrypoint.sh — no Rust, Dockerfile, or manifest changes needed

What it does

When GPU_ENABLED=true and /dev/dxg exists (WSL2), the entrypoint:

  1. Generates CDI spec via nvidia-ctk cdi generate (auto-detects WSL mode)
  2. Adds per-GPU UUID and index device entries (nvidia-ctk only generates name=all, but the device plugin assigns GPUs by UUID)
  3. Bumps CDI spec version from 0.3.0 to 0.5.0 (library minimum)
  4. Patches the spec to include libdxcore.so (upstream nvidia-ctk bug — nvidia-ctk cdi generate: libdxcore.so not found on WSL2 despite being present nvidia-container-toolkit#1739)
  5. Switches nvidia-container-runtime from auto to cdi mode
  6. Deploys a k3s Job to label the node with pci-10de.present=true (NFD can't detect NVIDIA PCI on WSL2's virtualised bus)

On non-WSL2 hosts, the new code path is never entered (/dev/dxg doesn't exist).

Testing

Verified on:

  • Hardware: Framework 16 laptop, AMD CPU, NVIDIA RTX 5070 (8GB VRAM) + 96GB DDR5 shared
  • OS: WSL2 (Linux 6.6.87.2-microsoft-standard-WSL2)
  • Driver: NVIDIA 595.71, CUDA 13.2
  • Result: nvidia-device-plugin 1/1 Running, nvidia.com/gpu: 1 advertised, nvidia-smi works inside sandbox pods, full NemoClaw onboard + sandbox creation + local inference (ollama nemotron 70B) working end-to-end

Related

Agent Investigation

Diagnosed using openshell doctor commands. Full diagnostic chain documented in #404.

🤖 Generated with Claude Code

tyeth added 3 commits March 17, 2026 20:09
…chart

WSL2 virtualises GPU access through /dev/dxg instead of native /dev/nvidia*
device nodes, which breaks the entire NVIDIA k8s device plugin detection
chain. Three changes fix this:

1. Detect WSL2 in cluster-entrypoint.sh and configure CDI mode:
   - Generate CDI spec with nvidia-ctk (auto-detects WSL mode)
   - Patch the spec to include libdxcore.so (nvidia-ctk bug omits it)
   - Switch nvidia-container-runtime from auto to cdi mode
   - Deploy a job to label the node with pci-10de.present=true
     (NFD can't see NVIDIA PCI on WSL2's virtualised bus)

2. Bundle the nvidia-device-plugin Helm chart in the cluster image
   instead of fetching from the upstream GitHub Pages repo at startup.
   The repo URL (nvidia.github.io/k8s-device-plugin/index.yaml)
   currently returns 404.

3. Update the HelmChart CR to reference the bundled local chart
   tarball via the k3s static charts API endpoint.

Closes NVIDIA#404
The upstream Helm repo URL works fine; remove the unnecessary chart
bundling and local reference changes.
WSL2 virtualises GPU access through /dev/dxg instead of native /dev/nvidia*
device nodes, which breaks the entire NVIDIA k8s device plugin detection
chain. This patch detects WSL2 at container startup and applies fixes:

1. Generate CDI spec with nvidia-ctk (auto-detects WSL mode)
2. Add per-GPU UUID and index device entries to CDI spec (nvidia-ctk
   only generates name=all but the device plugin assigns GPUs by UUID)
3. Bump CDI spec version from 0.3.0 to 0.5.0 (library minimum)
4. Patch the spec to include libdxcore.so (nvidia-ctk bug omits it;
   this library bridges Linux NVML to the Windows DirectX GPU Kernel)
5. Switch nvidia-container-runtime from auto to cdi mode
6. Deploy a job to label the node with pci-10de.present=true
   (NFD can't see NVIDIA PCI on WSL2's virtualised bus)

Closes NVIDIA#404
@tyeth-ai-assisted tyeth-ai-assisted requested a review from a team as a code owner March 17, 2026 22:03
@github-actions
Copy link

Thank you for your interest in contributing to OpenShell, @tyeth-ai-assisted.

This project uses a vouch system for first-time contributors. Before submitting a pull request, you need to be vouched by a maintainer.

To get vouched:

  1. Open a Vouch Request discussion.
  2. Describe what you want to change and why.
  3. Write in your own words — do not have an AI generate the request.
  4. A maintainer will comment /vouch if approved.
  5. Once vouched, open a new PR (preferred) or reopen this one after a few minutes.

See CONTRIBUTING.md for details.

@github-actions
Copy link

Thank you for your submission! We ask that you sign our Developer Certificate of Origin before we can accept your contribution. You can sign the DCO by adding a comment below using this text:


I have read the DCO document and I hereby sign the DCO.


You can retrigger this bot by commenting recheck in this Pull Request. Posted by the DCO Assistant Lite bot.

@github-actions github-actions bot closed this Mar 17, 2026
@tyeth
Copy link

tyeth commented Mar 17, 2026

I have read the DCO document and I hereby sign the DCO.

1 similar comment
@tyeth-ai-assisted
Copy link
Author

I have read the DCO document and I hereby sign the DCO.

tyeth-ai-assisted pushed a commit to tyeth-ai-assisted/NemoClaw that referenced this pull request Mar 17, 2026
WSL2 GPU support:
- Add wsl2-gpu-fix.sh that applies CDI mode, libdxcore.so injection,
  and node labeling after gateway start (workaround until OpenShell
  ships native WSL2 support via NVIDIA/OpenShell#411)
- Hook it into both onboard.js (interactive wizard) and setup.sh
  (legacy script) so it runs automatically after gateway creation
- Writes a complete CDI spec from scratch instead of fragile sed
  patching of the nvidia-ctk generated spec

Ollama on Linux:
- setup.sh only created the ollama-local provider on macOS (Darwin)
- Now detects ollama on any platform (Linux/WSL2 included)
- Enables local GPU inference via ollama for WSL2 users

Closes NVIDIA/NemoClaw#TBD
See also: NVIDIA/OpenShell#404, NVIDIA/OpenShell#411
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: GPU passthrough fails on WSL2 — NVML init fails without CDI mode and libdxcore.so

2 participants